{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c8340fe748443685",
   "metadata": {},
   "source": [
    " # 通过实验管理追踪和对比PAI-QuickStart模型训练任务\n",
    "\n",
    "模型训练通常是一个需要多次尝试和实验的过程，开发者需要通过配置使用不同的数据集、训练超参，或是不同的预训练模型进行迭代，监控训练是否收敛，比对多次训练任务的指标，从而选择出效果更好的模型，这通常依赖于使用实验管理工具来实现。\n",
    "\n",
    "[TensorBoard](https://www.tensorflow.org/tensorboard?hl=zh-cn)是一个常用的追踪和可视化工具，可以用于记录并展示模型训练过程中的损失和精度等标量数据。PAI提供TensorBoard服务，支持开发者在云上运行一个TensorBoard实例，监控训练。\n",
    "\n",
    "通过实验管理，开发者可以方便的在同一个TensorBoard实例上查看对比同一个实验下不同训练任务的指标，不再需要手动管理TensorBoard日志，\n",
    "\n",
    "本示例将以PAI-QuickStart提供的预训练模型的微调任务为例，演示如何通过PAI Python SDK使用PAI提供的实验能力，来组织和对比模型微调任务指标。\n",
    "\n",
    "\n",
    "## 费用说明\n",
    "\n",
    "本示例将会使用以下云产品，并产生相应的费用账单：\n",
    "\n",
    "- PAI-DLC：运行训练任务，详细计费说明请参考[PAI-DLC计费说明](https://help.aliyun.com/zh/pai/product-overview/billing-of-dlc)\n",
    "- OSS：存储训练任务输出的模型、训练代码、TensorBoard日志等，详细计费说明请参考[OSS计费概述](https://help.aliyun.com/zh/oss/product-overview/billing-overview)\n",
    "\n",
    "\n",
    "> 通过参与云产品免费试用，使用**指定资源机型**提交训练作业或是部署推理服务，可以免费试用PAI产品，具体请参考[PAI免费试用](https://help.aliyun.com/zh/pai/product-overview/free-quota-for-new-users)。\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e80489c7d269c66",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-05-10T10:53:39.498381Z",
     "start_time": "2024-05-10T10:53:39.467495Z"
    }
   },
   "source": [
    "## 安装和配置SDK\n",
    "\n",
    "我们需要安装PAI Python SDK以运行本示例。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8064d95fb7663d96",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m pip install --upgrade pai"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e92e2e988d951a",
   "metadata": {},
   "source": [
    "SDK需要配置访问阿里云服务需要的AccessKey，以及当前使用的工作空间和OSS Bucket。在PAI SDK安装之后，通过在 **命令行终端** 中执行以下命令，按照引导配置密钥、工作空间等信息。\n",
    "\n",
    "```shell\n",
    "\n",
    "# 以下命令，请在 命令行终端 中执行.\n",
    "\n",
    "python -m pai.toolkit.config\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7bc2facf0611217c",
   "metadata": {},
   "source": [
    "我们可以通过以下代码验证配置是否已生效。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70dacf7e9f406070",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pai\n",
    "from pai.session import get_default_session\n",
    "\n",
    "print(pai.__version__)\n",
    "\n",
    "sess = get_default_session()\n",
    "\n",
    "# 获取配置的工作空间信息\n",
    "assert sess.workspace_name is not None\n",
    "print(sess.workspace_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb8ce4373d3e9903",
   "metadata": {},
   "source": [
    "## 创建实验\n",
    "\n",
    "首先,我们需要创建一个实验。指定实验名称和输出路径。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3b5310a914fd9fa6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pai.experiment import Experiment\n",
    "\n",
    "# 指定实验名称，同一个工作空间中，实验名称必须是唯一的\n",
    "experiment_name = \"test_experiment3\"\n",
    "\n",
    "# 使用工作空间默认Bucket与实验名称组合作为实验的输出路径，如果需要指定其他路径，请修改。\n",
    "# 目前仅支持OSS，请确保拥有对应Bucket的读写权限。\n",
    "default_bucket_name = sess.oss_bucket_name\n",
    "endpoint = sess.oss_endpoint\n",
    "artifact_uri = f\"oss://{default_bucket_name}.{endpoint}/{experiment_name}/\"\n",
    "\n",
    "# 创建实验\n",
    "experiment = Experiment.create(name=experiment_name, artifact_uri=artifact_uri)\n",
    "\n",
    "# 查看实验ID\n",
    "print(experiment.experiment_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0b7ef42c7a641bc",
   "metadata": {},
   "source": [
    "查看实验默认的TensorBoard日志存储路径。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e9ab341ec875b0c3",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(experiment.tensorboard_data())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4ce9a1a54d40663",
   "metadata": {},
   "source": [
    "## 提交训练任务到实验"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "781934b47fce338d",
   "metadata": {},
   "source": [
    "PAI-QuickStart提供了大量预训练模型，包括生成式AI、计算机视觉等多个方向及领域。我们可以通过PAI-Python-SDK获取模型列表并对其进行训练，详细的操作方式请参考：https://gallery.pai-ml.com/#/preview/paiPythonSDK/training/pretrained_model。\n",
    "\n",
    "在本示例中，我们将以`Bert`模型为例，展示如何使用实验聚合多个模型微调训练任务并对比其训练指标。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fdaaa75ab4a3f9d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pai.model import RegisteredModel\n",
    "import json\n",
    "from pai.estimator import AlgorithmEstimator\n",
    "from pai.experiment import ExperimentConfig\n",
    "\n",
    "# 获取PAI模型仓库中名称为bert-base-uncased模型\n",
    "m = RegisteredModel(\n",
    "    model_name=\"bert-base-uncased\",\n",
    "    model_provider=\"pai\",\n",
    ")\n",
    "\n",
    "# 通过注册模型的配置，获取相应的预训练算法\n",
    "est: AlgorithmEstimator = m.get_estimator(\n",
    "    # 指定训练机器的规格\n",
    "    instance_type=\"ecs.gn7i-c8g1.2xlarge\",\n",
    "    experiment_config=ExperimentConfig(\n",
    "        experiment_id=experiment.experiment_id,\n",
    "    ),\n",
    ")\n",
    "\n",
    "# 查看算法的超参定义\n",
    "print(json.dumps(est.hyperparameter_definitions, indent=4))\n",
    "\n",
    "# 查看算法默认的超参信息\n",
    "print(\"before\")\n",
    "print(est.hyperparameters)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "852a312331eeabb6",
   "metadata": {},
   "source": [
    "修改算法的超参配置,提交训练任务。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "78be8d8cd6a4c3df",
   "metadata": {},
   "outputs": [],
   "source": [
    "est.set_hyperparameters(max_epochs=3, learning_rate=1.1e-5, save_step=10)\n",
    "\n",
    "print(\"after\")\n",
    "print(est.hyperparameters)\n",
    "\n",
    "# 获取默认训练输入\n",
    "default_inputs = m.get_estimator_inputs()\n",
    "\n",
    "# 创建训练任务\n",
    "est.fit(inputs=default_inputs, wait=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "401a7fecbf2b2b23",
   "metadata": {},
   "source": [
    "再一次修改超参数配置，提交新的训练任务"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a5046db12dfbbce",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 调整超参配置\n",
    "est.set_hyperparameters(max_epochs=4, learning_rate=1.2e-5)\n",
    "\n",
    "print(\"after\")\n",
    "print(est.hyperparameters)\n",
    "\n",
    "# 创建新的训练任务\n",
    "est.fit(inputs=default_inputs, wait=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3fbfeaf973f8cdc",
   "metadata": {},
   "source": [
    "## 通过实验的TensorBoard对比训练指标"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67bdba17310bb254",
   "metadata": {},
   "source": [
    "我们可以使用PAI TensorBoard服务，实时的查看实验中所有任务的TensorBoard日志。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "99c4fe6769b7dd95",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 启动实验的TensorBoard应用\n",
    "tensorboard = experiment.tensorboard()\n",
    "\n",
    "# 查看TensorBoard的应用URL\n",
    "print(tensorboard.app_uri)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "daf7e6b75ae04d8f",
   "metadata": {},
   "source": [
    "使用完成之后，删除TensorBoard应用。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4891040e57d185e2",
   "metadata": {},
   "outputs": [],
   "source": [
    "tensorboard.delete()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f603c1ec6aaed86b",
   "metadata": {},
   "source": [
    "我们也可以使用本地拉起的TensorBoard服务来查看TensorBoard日志。注意，TensorBoard日志是随着任务的运行不断写出的，日志文件会不断更新，需要下载最新的日志文件才能查看到最新的数据。\n",
    "\n",
    "首先需要在本地安装安装TensorBoard"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d4a355db36ffb93a",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install tensorboard"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fbc051bdc1eced75",
   "metadata": {},
   "source": [
    "下载TensorBoard日志文件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30b702d094c1b520",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pai.common import oss_utils\n",
    "from pai.common.oss_utils import OssUriObj\n",
    "\n",
    "oss_uri = OssUriObj(experiment.tensorboard_data())\n",
    "store_dir = \"./tensorboard_logs\"\n",
    "\n",
    "oss_utils.download(\n",
    "    oss_path=oss_uri.object_key,\n",
    "    local_path=store_dir,\n",
    "    bucket=sess.get_oss_bucket(oss_uri.bucket_name),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3aadda9f3942313c",
   "metadata": {},
   "source": [
    "通过shell命令在本地拉起TensorBoard服务。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b23e680e4e26d9cc",
   "metadata": {},
   "outputs": [],
   "source": [
    "!tensorboard --logdir \"$target_path\""
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
